#setting options for our code chunks
knitr::opts_chunk$set(message = FALSE, warning = FALSE, tidy = TRUE)


Introduction

The purpose of this project is to create a model for predicting the rating of an Airbnb based on some of its characteristics.


What is (an) Airbnb?

Airbnb is a vacation rental company that allows users to book privately owned residences to stay at overnight. The type of residence can vary from booking to booking: some are homes, others are rooms, some are apartments, and others offer unique experiences such as tents or RVs. These residences are often collectively referred to as ‘Airbnbs’.

Take a look at the website here.

The website itself is easy to navigate. Users can input the locations they are looking for, sort bookings by price, or even specify the type of experience they would like. Once a user picks a specific Airbnb they are interested in, they are able to look at more specific information about the Airbnb including its rating, reviews, amenities, amount of rooms, type of rooms, and of course the dates it’s available to book.


Why might this model be useful?

Like most other products, those that have better ratings are often times of more interest to consumers! If a host is looking to upload a new Airbnb, can they get a glimpse into how satisfied customers will be with their stay? Our model can help them make changes to the property that might ensure a better stay for their visitors.


Loading Data and Packages

First, we will load in our packages and data. The data comes from this Kaggle dataset that was updated about a year ago. It contains data on over 250,000 Airbnbs from ten cities (and additionally a separate dataset containing reviews for Airbnbs).

The full codebook is available to look at in the data folder, but some of the variables we will be looking at include:

  • listing_id - the Listing ID

  • host_id - the host ID

  • host_since - the date the Host joined Airbnb

  • host_response_rate - percentage of times the Host responds

  • host_acceptance_rate - percentage of times the Host accepts a booking request

  • host_is_superhost - a binary field to determine if the Host is a Superhost

  • host_total_listings_count - total listings the Host has in Airbnb

  • host_has_profile_pic - a binary field to determine if the Host has a profile picture

  • host_identity_verified - a binary field to determine if the Host has a verified identity

  • property_type - the Listing property type

  • room_type - the Listingd room type

  • amenities - a list of amenities the Listing includes

  • review_scores_rating - the Listing’s overall rating out of 100

# loading in our packages
library(tidymodels)
library(tidyverse)
library(ggplot2)
library(maps)
library(corrplot)
library(forcats)
library(corrr)
library(yardstick)
library(sf)
library(mapview)
library(dplyr)
library(stringr)
library(janitor)
library(lubridate)
library(MASS)
library(knitr)
tidymodels_prefer()
listings <- read.csv('/Users/Sofia/Desktop/PSTAT131/Final Project/Airbnb Data/Listings.csv')

We have a grand total of 279,712 observations and 33 variables! This is a lot of data to wrangle, but we will do our best to clean out as much of it as we can.


Preprocessing

Next, we will transform our data so that it is easier to work with when we began analyzing it!

We’ll start off by removing some of the more inconsistent variables like the name of the Airbnb, host_location, neighbourhood, and district.

complete_listings <- listings %>% 
  select(-name, -host_location, -neighbourhood, -district)

This brings us down to 29 variables! Now as we take a look at our observations, we can see that many of the observations have incomplete entries with multiple missing entries. We will remove any incomplete cases from the dataset:

complete_listings <- complete_listings[complete.cases(complete_listings), ]

This narrows down our dataset to 93,324 observations which is about a third of the data we had started with. This is still plenty of data to work with!


Cleaning Out the Amenities List

One of our variables is called amenities and contains a list of amenities for each entry: this is likely very valuable information, but we can not work with the data as it is. Instead, we will make the most common amenities into dummy variables for each Airbnb.

We start off by creating a separate dataframe with just listing_id and amenities:

amenities_list <- complete_listings %>% select(c("listing_id", "amenities"))

We then apply a function that cleans our text data and unlists each value in the list of amenities. We get a resulting dataframe with two columns for listing IDs and each corresponding amenity as a separate entity. There’s over two million amenities in our 93,324 Airbnbs!

amenitiesFunction <- function(n){
  output <- n %>% 
    str_to_lower() %>%                      # make all amenities lowercase
    str_split(",") %>%                      # split each list at a comma
    unlist() %>%                            # unlist the amenities
    str_replace_all("[[:punct:]]", "") %>%  # get rid of any punctuation
    str_trim() %>%                          # trim white spaces in amenities
    tibble()                                # display as tibble
return(output)
}

# new column with our cleaned amenities
amenities_list$amenities_new <- map(.x = amenities_list$amenities, 
                                       .f = amenitiesFunction) 

# unnest to see each amenity with the corresponding listing
amenities_list <- unnest(amenities_list, cols = amenities_new)

# remove the old column and rename the new amenities
amenities_list <- amenities_list %>% select(!amenities)
colnames(amenities_list)[2] = 'new_amenities'

# save our file
save(amenities_list, file = 'bigFiles/amenities_list')

Since there is a lot of variation among the amenities, our next step is to filter any infrequent ones, categorize them as Other, and format it to fit the rest of our listing data:

# load file
load(file = 'bigFiles/amenities_list')

amenities_list2 <- amenities_list %>%
  mutate(new_amenities = fct_lump(new_amenities, prop = .01)) %>% # classify others
  mutate(row = row_number()) %>%  # mutate a column with row counts
  pivot_wider(names_from = new_amenities, values_from = row) # pivot wider

Unfortunately, because there are multiple counts of Other in some of the entries, pivot_wider returns lists of values. To bypass this, I created a new data frame that contained the lengths of each list which told us which listings have or do not have certain amenities:

amenities_list3 <- tibble(.rows = 93324) # empty tibble
amenities_list3$listing_id <- amenities_list2$listing_id # add in listing_id

# function to check if the list is empty
for (n in amenities_list2[2:38]) { 
  n1 <- ifelse(lengths(n) == 0, 'f', 't') # classify whether t or f
  amenities_list3[ , ncol(amenities_list3) + 1] <- n1 # rename columns
}

# append correct column names
colnames(amenities_list3) <- colnames(amenities_list2) 

#save file
write_rds(amenities_list3, file = 'bigFiles/amenities_list_final')

PHEW! That was a doozy. Now, let’s append these amenities to our original data frame:

# load file
amenities_final <- read_rds(file = 'bigFiles/amenities_list_final')

# bind amenities tibble and remove old amenities variable
complete_listings <- complete_listings %>% 
  cbind(amenities_final[2:38]) %>% 
  select(-amenities)

head(complete_listings) %>% as_tibble()

That’s really nice to look at! However, we still have some things to take care of…


Factors and Cleaning Names

We will also categorize rarer property types as Other in our main dataframe:

complete_listings <- complete_listings %>%
  mutate(property_type = fct_lump(property_type, prop = .01))

Then, we will create a column that contains the date when the user became a Host in lubridate format and one that only contains the year when the user became a Host for simplicity purposes:

complete_listings <- complete_listings %>%
  mutate(date = parse_date(host_since, "%Y-%m-%d"),
    host_since_year = year(date))

We will make the year into a factor as well:

complete_listings$host_since_year <- complete_listings$host_since_year %>%
  as.factor()

And then we will transform all the variables that are either TRUE or FALSE into factors (including our amenities).

factorFunction <- function(n) {
  n <- factor(n, levels = c("t", "f"))
return(n)}
complete_listings[c(7,9,10,28:65)] <- lapply(complete_listings[c(7,9,10,28:65)], factorFunction)

We’re almost done with the pre-processing of our data! Finally, we’re going to clean our variable names to make them easier to work with:

complete_listings <- complete_listings %>% clean_names()

… and now we can begin work on our models!


Model Set-Up

To set up our models, we will be splitting our data, exploring it, creating a recipe, and splitting it once again into cross-validation folds.

Splitting Data

To determine how successful our models are, we will be splitting our data: we will use some of it to train our models and we will use the remaining parts to test it. We set our seed (for replicability), make our split ratio 80/20, and split on ratings so we have similar distributions in both training and testing:

set.seed(128)

abnb_split <- initial_split(complete_listings, prop = 0.80,
                                strata = review_scores_rating)
abnb_tr <- training(abnb_split)
abnb_te <- testing(abnb_split)
abnb_split
## <Training/Testing/Total>
## <74658/18666/93324>

In our training set, we will be working with 74,658 Airbnbs!


Exploring the Data

Now we finally get to take a look at the shape of our data!

Lets start by taking a look at the locations of our listings around the world:

# making the map
tr_map <- mapview(abnb_tr, xcol = "longitude", ycol = "latitude", 
                  crs = 4326, grid = FALSE)

# save
save(tr_map, file = 'bigFiles/tr_map')
# refining options
options(mapviewMaxPixels = 10000)
mapviewOptions(maxpoints = 2000, maxpolygons = 2000, maxlines = 2000)

#loading
load(file='bigFiles/tr_map')
tr_map

It’s clear that we only have listings from a couple of cities. The cities we have are: Bangkok, Cape Town, Hong Kong, Istanbul, Mexico City, New York, Paris, Rio de Janeiro, Rome, and Sydney.

Lets see how these cities vary among each other in their rating distributions:

ggplot(abnb_tr, aes(review_scores_rating)) +
  geom_histogram(fill = "indianred2", binwidth = 2) +
  facet_wrap(~city, scales = "free_y") +
  labs(
    title = "Histogram of Reviews by City"
  )

The distributions look pretty similar in each city! Most of the cities except for Rome show a large spike in perfect scores.

Here we can look at the overall distribution of ratings among all Airbnbs in our data:

ggplot(abnb_tr, aes(review_scores_rating)) +
  geom_histogram(bins = 60, fill = "indianred2") +
  labs(
    title = "Histogram of Reviews"
  )

We see an upward trend in ratings, but we have to keep in mind that we do not know how many total reviews they have received! Overall, people tend to be pretty satisfied with their stays, though we do see small spikes in data which indicate some variation.

Now, let’s look at the distribution of ratings by property type since that may play a large part into what customers are satisfied with:

ggplot(abnb_tr, aes(review_scores_rating)) +
  geom_histogram(fill = "indianred2") +
  facet_wrap(~property_type, scales = "free_y") +
  labs(
    title = "Histogram of Reviews by Property Type"
  )

Most of our graphs seem to be showing similar patterns to each other, though there are subtle differences between them. For example, Airbnbs that are classified as rooms in hotels have a bit of a spike around the 80 and the 90 point review marks. Airbnbs such as entire guest suites seem to have a fairly consistent exponential spike in reviews. This makes sense since the two property types are at different price points, so customers are–on average–paying for a better experience.

Speaking of pricing, we can take a look at the correlations between price and review_scores_rating for each type of property:

ggplot(abnb_tr, aes(review_scores_rating, price)) +
  geom_point(alpha = 0.1, colour = 'indianred2') +
  geom_smooth(se = FALSE, color = "black", size = 1) +
  facet_wrap(~property_type, scales = "free_y") +
  labs(
    title = "Reviews versus Price by Property Type"
  )

As can be expected, the pricing and the review scores do tend to show a slight positive correlation, though a majority of the bookings stay on the less expensive end. A private room in bed and breakfast and entire guest suites show some unique graphs, but they are also among the ones with the least amount of reviews.

Out of curiosity, I am also going to take a look at the distribution of reviews for a random amenity. I’m personally curious about whether having hot water would or would not affect the rating of an Airbnb:

ggplot(abnb_tr, aes(review_scores_rating)) +
  geom_bar(aes(fill = hot_water)) +
  scale_fill_manual(values = c("indianred2", "skyblue"))

Seems like people don’t mind not having hot water! Or at least they don’t expect to have it if it’s not included within the amenities. Nevertheless, it’s fun to look at!

Finally, let’s take a general look at how much each non-character variable correlates with the others:

# making numeric dataset
numeric_col <-
  abnb_tr %>%
  select_if(is.numeric) %>%
  colnames()

# compute covarience
numabnb <- select(abnb_tr, all_of(numeric_col)) %>%
  cor()

# corrplot
corrplot(numabnb, is.corr = FALSE, type = "lower", tl.col = 'black')

We do not see a huge correlation between most of these points aside from the total_review_rating and the subsequent breakdowns of the scores. These correlations make sense given that the total review scores rating is based on the makeup of the other scores! We can also see that the amount of bedrooms correlates to the amount of people the Airbnb can accommodate which also makes sense. Price seems to have a slight positive correlation with both of those variables as well. Finally, we see a correlation between latitude and longitude (this also makes sense since we are only looking at a select few cities) as well as a correlation between host_id and booking_id. Though I’m not sure about the inner workings of the Airbnb system, it would make sense to assume that the booking_id is generated using the host_id.

Now that we know a little bit more about what we’re working with, let’s finish up setting up our models!


Creating a Recipe

In order to build our models, we will create a singular recipe that will tell each model how to use the data we are giving it and what we are trying to predict: in this case it is the total_review_rating. Here, I am choosing to exclude a few of the variables from our data set. I will be excluding:

  • latitude and longitude since in between values won’t mean much to us
  • listing_id and host_id since they are random
  • host_since and date since we will be using host_since_year
  • any other variables starting with review since they will directly correlate to the total rating

Then I will be creating dummy variables for all the nominal variables as well as normalizing all the predictors:

abnb_recipe <- recipe(review_scores_rating ~ host_response_time +
                        host_response_rate + host_acceptance_rate + 
                        host_is_superhost + host_total_listings_count + 
                        host_has_profile_pic + host_identity_verified + city +
                        property_type + room_type + accommodates + bedrooms +
                        price + minimum_nights + maximum_nights + 
                        instant_bookable + shampoo + dishes_and_silverware + 
                        heating + iron + kitchen + hair_dryer + essentials +
                        washer + bed_linens + refrigerator + hot_water + oven +
                        wifi + cooking_basics + long_term_stays_allowed +
                        dedicated_workspace + elevator + hangers + coffee_maker +
                        carbon_monoxide_alarm + smoke_alarm + other + microwave + 
                        air_conditioning + free_street_parking + dryer + 
                        fire_extinguisher + extra_pillows_and_blankets + tv +
                        cable_tv + first_aid_kit + private_entrance + 
                        luggage_dropoff_allowed + free_parking_on_premises + 
                        stove + host_greets_you + patio_or_balcony + 
                        host_since_year, data = abnb_tr) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_predictors()) %>% 
  prep()

bake(abnb_recipe, new_data = NULL)

We have quite the recipe here with 92 variables! Hopefully my computer will be able to process all this data that we are about to analyze.


Cross-Validation

In our last step before we finally build our models, we will be splitting our training data into folds. This step will provide us with mini training and testing sets that we can use to train our models and see which one performs best. This step increases our likelihood of getting better results in our actual testing set. I will only be creating three folds because of the size of the data set to reduce the run time for our models:

set.seed(128)
abnb_folds <- vfold_cv(abnb_tr, v = 3, strata = review_scores_rating)

Then, I’m going to save the training data, folds, and recipe in order to employ them later while building the actual models:

save(abnb_tr, abnb_folds, abnb_recipe, file = "model_basics/model_setup.rda")


Model Building

We will be constructing four models to see which one will help us predict review score ratings the best. The models we will use are:

  • Linear Regression
  • Regularized Regression
  • Boosted Trees
  • Random Forest


Linear Regression

To start we will clarify that we will be conducting a linear regression analysis and set our engine to ‘lm’:

lm_model <- linear_reg() %>% 
  set_mode("regression") %>%
  set_engine("lm")

Then we will set up our workflow:

lm_wflow <- workflow() %>%    #empty workflow
  add_model(lm_model) %>%     #add model
  add_recipe(abnb_recipe)     #add recipe

Now we will fit our model to our training data and save our results:

lm_fit <- fit_resamples(lm_wflow, abnb_folds)

write_rds(lm_fit, file = 'results/lm_results')

And that’s it for linear regression! Onto the next model.


Regularized Regression

Now we will be tuning some of our regression parameters using an elastic net that will hopefully give us more accurate results than the linear regression model! We begin by setting up the model and the engine like we did for the linear regression model:

elastic_net_spec <- multinom_reg(penalty = tune(),
                                 mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

Then we create our workflow:

en_workflow <- workflow() %>% 
  add_recipe(abnb_recipe) %>%
  add_model(elastic_net_spec)

And our grid:

en_grid <- grid_regular(penalty(range = c(-5,5)),
                        mixture(range = c(0,1)), 
                        levels = 10)

And we finish off by tuning the model itself:

tune_reg_reg <- tune_grid(
  en_workflow,
  resamples = abnb_folds, 
  grid = en_grid)

write_rds(tune_reg_reg, file = 'results/regreg_results')


Boosted Trees

With this being our third model, we’re starting to get into the groove of things. To make our boosted trees we will be setting up the model:

boosted_tree_spec <- boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression") %>% 
  set_args(trees = tune(),
           learn_rate = tune())

Setting up the workflow:

boosted_tree_wf <- workflow() %>% 
  add_model(boosted_tree_spec) %>% 
  add_recipe(abnb_recipe)

Making the tuning grid with ten levels of trees and learning rate:

param_grid_boosted_tree <- grid_regular(trees(range = c(10,300)), levels = 10, learn_rate(range = c(-10, -1)))

And then tuning that same grid and saving it:

tune_boosted_tree <- tune_grid(
  boosted_tree_wf,
  resamples = abnb_folds,
  grid = param_grid_boosted_tree,
  metrics = metric_set(rmse, rsq)
)

write_rds(tune_boosted_tree, file = 'results/bt_results')


Random Forest

Our final model is our most complex and long winded: the random forest. We will follow the same steps to set it up, and We will start by setting up the engine and the model:

forest_spec <- rand_forest() %>% 
  set_engine("ranger", importance = 'impurity') %>% 
  set_mode("regression") %>% 
  set_args(mtry = tune(),
           trees = tune(),
           min_n = tune())

Then the workflow:

forest_wf <- workflow() %>% 
  add_model(forest_spec) %>% 
  add_recipe(abnb_recipe)

Next up is our grid. However, we have to be careful making this grid since the parameters could greatly affect our run time. We will use a tree range between 100-500, and eight levels for the predictors and required nodes for a split:

param_grid_forest <- grid_regular(mtry(range = c(1,10)), min_n(range = c(5,20)),
                           trees(range = c(100, 500)), levels = 8)

And after all that we finish off by tuning the model:

tune_forest <- tune_grid(
  forest_wf,
  resamples = abnb_folds,
  grid = param_grid_forest,
  metrics = metric_set(roc_auc)
)

write_rds(tune_forest, file = 'results/rf_results')


Model Results

Now that we have ran our models, it is time to analyze our results. We’re going to start by loading in our data:

lm_results <- read_rds('results/lm_results')
regreg_results <- read_rds('results/regreg_results')
bt_results <- read_rds('results/bt_results')
rf_results <- read_rds('results/rf_results')

Since we are analyzing a continuous regression problem, we will be taking a look at two measurements. The first is the root-mean square deviation (rmse) which indicates the average difference between the true and the predicted value. The second is the root square error (rsq) that indicates the percentage of our outcome variable that can be explained by the predictors. Our goal is to minimize the rmse and to maximize the rsq. Let’s see how well our models did!


Linear Regression Results

Our simplest model was the linear regression model. Let’s take a look at the rmse and rsq:

lm <- collect_metrics(lm_results)
lm

…Not the best! We have an average difference of 8.89 points between listings and their predicted review scores rating, and we could only explain about 10.5% of our outcome. However, this is not unusual to see in linear regression, and it insinuates that our data is probably not linearly correlated.


Regularized Regression Results

Knowing our results for the linear regression model, we know we might see pretty similar results in our elastic net models! Let’s take a look at our autoplots:

autoplot(regreg_results)

The models did about the same as our linear regression model! The elastic net model did better with lower values of both regularization and lasso penalty with rmse and rsq. Let’s look at which specific values gave us the best rsq and rmse:

regreg <- regreg_results %>% 
  collect_metrics() %>% # collect metrics
  subset(.metric == "rmse") %>% # select rmse
  arrange(mean) %>% # arrange by ascending
  slice(1) # select the lowest one

regreg <- rbind(regreg, regreg_results %>% # bind to rmse results
  collect_metrics() %>% # collect metrics
  subset(.metric == "rsq") %>% # select rsq
  arrange(desc(mean)) %>% # arrange by descending
  slice(1)) # select the highest one

regreg

Pretty similar to the linear regression! Hopefully we will be able to see some improvements in the boosted tree models.


Boosted Trees Results

Here the models saw a substantial jump in run time. It took a little under an hour to fit all the boosted tree models to the data. Let’s hope to see some improvement, and take a look at the autoplot:

autoplot(bt_results)

Our models did have a slight improvement! Here we can see that lower learning rates seemed to have better output both in rmse and rsq. Though they are not the best of results, it is good to see improvement! Let’s take a look at which specific boosted tree models gave us the best rmse and rsq:

bt <- bt_results %>%
  collect_metrics() %>% # collect metrics
  subset(.metric == "rmse") %>% # select rmse
  arrange(mean) %>% # arrange by ascending
  slice(1) # select the lowest one

bt <- rbind(bt, bt_results %>% # bind to rmse results
  collect_metrics() %>% # collect metrics
  subset(.metric == "rsq") %>% # select rsq
  arrange(desc(mean)) %>% # arrange by descending
  slice(1)) # select the highest one

bt

Here we can see that the same exact model gave us the best rmse and rsq. It had 106 trees and a learning rate of 0.1! We see a decrease in rmse by about 0.21 points and an increase in rsq by 4%! Let’s see if our random forest can do better.


Random Forest Results

The random forest model had by far the longest run time. It had crashed my laptop after almost 24 hours of tuning, and took about nine and a half to run on a different laptop. After all that, let’s take a look at an autoplot of the results:

autoplot(rf_results)

Though they look pretty similar, we can see that models with more trees did better than the ones with less trees, and that more random predictors and more minimal node sizes also gave us better values. Like before, let’s look at the specific values of the best models:

rf <- rf_results %>% 
  collect_metrics() %>% # collect metrics
  subset(.metric == "rmse") %>% # select rmse
  arrange(mean) %>% # arrange by ascending
  slice(1) # select the lowest one

rf <- rbind(rf, rf_results %>% # bind to rmse results
  collect_metrics() %>% # collect metrics
  subset(.metric == "rsq") %>% # select rsq
  arrange(desc(mean)) %>% # arrange by descending
  slice(1)) # select the highest one

rf

We saw a slight increase in our model this time as well! Our rmse went down by 0.08 and our rsq went up by 1.6%. Though still pretty low, it is nice to see our rmse and rsq improve!

We can also take a look at the variable importance plot for our rmse to see which variables were split on most often:

load('results/rf_vip')
rf_vip

The price of the Airbnb, whether the Host was a superhost, and the total amount of listings a Host has, seem to be the most decisive variables! It makes sense that the price of the Airbnb and a higher amount of experience of the Host would ensure a better stay for the occupants.

Now, having run all of our models, we will choose the best one and fit it to our testing set!


Choosing a Model

Let’s compare our models side by side to choose which one we will fit to our testing set. First let’s look at rsme:

# append rmses from each model
rmses <- c(lm %>% subset(.metric == "rmse") %>% select(mean), 
           regreg %>% subset(.metric == "rmse") %>% select(mean), 
           bt %>% subset(.metric == "rmse") %>% select(mean), 
           rf %>% subset(.metric == "rmse") %>% select(mean))

# name each model
models <- c("Linear Regression", 
            "Regularized Regression", 
            "Boosted Tree", 
            "Random Forest")

# combine
results <- tibble(models = models, rmse = rmses)

# unnest and arrange
results %>% 
  unnest(cols = rmse) %>% 
  arrange(rmse)

Here we can see that the Random Forest model performed the best with the lowest rmse! Let’s see if it is any different for the rsq:

# append rsqs from each model
rsqs <- c(lm %>% subset(.metric == "rsq") %>% select(mean), 
           regreg %>% subset(.metric == "rsq") %>% select(mean), 
           bt %>% subset(.metric == "rsq") %>% select(mean), 
           rf %>% subset(.metric == "rsq") %>% select(mean))

# name each model
models <- c("Linear Regression", 
            "Regularized Regression", 
            "Boosted Tree", 
            "Random Forest")

# combine
results <- tibble(models = models, rsq = rsqs)

# unnest and arrange
results %>% 
  unnest(cols = rsq)%>% 
  arrange(desc(rsq))

We can see here that the Random Forest had the best results yet again!

After taking a quick peek, our forest with the highest rmse had a larger jump in value compared to other rmses. It also retained a high rsq, making it our best contender! With 10 randomly selected predictors, a minimal node size of 17, and 500 trees, we will be fitting this random forest to our testing data!


Fitting to the Testing Set

Now that we’ve decided on a model, we will fit our random forest to the testing data. To do that, we will finalize our workflow, and finally fit the model!

final_rf_wf <- forest_wf %>% 
  finalize_workflow(select_best(rf_results, "rmse"))   #finalizing workflow

final_fit <- fit(final_rf_wf, data = abnb_te)         #fitting to test set

Now that the data is fit, we can use it to predict values for our listings! Let’s cross our fingers and hope for the best:

# combine actual values and predictions
preds <- augment(final_fit, new_data = abnb_te) %>% 
  select(review_scores_rating, starts_with(".pred"))

preds %>% head()

Looking at these values, we can see that there is a little bit of a difference between the actual values and our predicted ones. However, the model is able to still distinguish between which reviews should be rated higher and which ones should be rated lower!

For the grand reveal let’s look at our rmse and rsq:

final_metrics <- rmse(preds, truth = review_scores_rating, estimate = .pred) %>% 
  rbind(rsq(preds, truth = review_scores_rating, estimate = .pred))

final_metrics

WOW! Now that is a relief to look at. Our testing set took our rmse from 8.61 to 6.19 and our rsq from 0.16 to 0.70! The difference is really surprising, but it proves that our hard work did pay off!

Let’s look at the listings that had the biggest differences between actual values and predictions:

augment(final_fit, new_data = abnb_te) %>% 
  mutate(diff = abs(review_scores_rating - .pred)) %>% 
  filter(diff >= 20) %>% 
  select(review_scores_rating, .pred) %>% 
  ggplot(aes(x = review_scores_rating, y = .pred)) +
  geom_point(colour = "indianred2", alpha = 1) +
  labs(x = "Review Scores Rating", 
       y = "Predicted Value", 
       title = "Reviews with Differences Over 20 Points")

We can tell by this graph that our model does a poor job of predicting lower rated Airbnb review score ratings. Looking more closely, we can see that many of the values fall on multiples of 20, specifically 20, 40, and 60. This is likely because the specific Airbnbs received minimal ratings, which is why our model does a poor job of predicting them. In the future, if there is a possibility of adding review count to the dataset, it may increase our metrics even further.


Looking at Predictions

Now that we know a little bit about our general statistics, let’s take a look at some of the Airbnbs and if their ratings had changed since our metrics were taken! The system of rating at Airbnb had switched to stars since our data was taken, so we will be multiplying the star rating by 20 to get the equivalent in our metrics.


Airbnb One

Our first Airbnb is an apartment in Cape Town with a review scores rating of 20. Our model predicted the rating to be 75.7.

augment(final_fit, new_data = abnb_te) %>% subset(listing_id == 21031952)

As we can see, there had only been one review of the property which probably accounts for the poor rating! Though most of the information was the same, we can see that the Host changed the pricing of the Airbnb. Let’s see what that does to our score:

abnb_test1 <- data.frame(host_response_time = 'within an hour',
                  host_response_rate = 1.0, host_acceptance_rate = 1.0,
                  host_is_superhost = as.factor('f'), 
                  host_total_listings_count = 1, 
                  host_has_profile_pic = as.factor('t'), 
                  host_identity_verified = as.factor('f'),
                  city = "Cape Town", property_type = 'Entire apartment',
                  room_type = "Entire place", accommodates = 2, bedrooms = 1,
                  price = 49, minimum_nights = 1, maximum_nights = 1125,
                  instant_bookable = as.factor('t'), shampoo = as.factor('t'), 
                  dishes_and_silverware = as.factor('f'), 
                  heating = as.factor('t'), iron = as.factor('f'), 
                  kitchen = as.factor('t'), hair_dryer = as.factor('f'),
                  essentials = as.factor('t'), washer = as.factor('f'), 
                  bed_linens = as.factor('t'), refrigerator = as.factor('f'),
                  hot_water = as.factor('t'), oven = as.factor('f'), 
                  wifi = as.factor('t'), cooking_basics = as.factor('f'), 
                  long_term_stays_allowed = as.factor('t'),
                  dedicated_workspace = as.factor('t'), elevator = as.factor('f'), 
                  hangers = as.factor('t'), coffee_maker = as.factor('f'), 
                  carbon_monoxide_alarm = as.factor('f'), 
                  smoke_alarm = as.factor('f'), other = as.factor('t'), 
                  microwave = as.factor('f'), air_conditioning = as.factor('f'), 
                  free_street_parking = as.factor('f'), dryer = as.factor('f'),
                  fire_extinguisher = as.factor('f'), 
                  extra_pillows_and_blankets = as.factor('f'), tv = as.factor('t'),
                  cable_tv = as.factor('t'), first_aid_kit = as.factor('f'),
                  private_entrance = as.factor('t'), 
                  luggage_dropoff_allowed = as.factor('t'), 
                  free_parking_on_premises = as.factor('t'), 
                  stove = as.factor('f'), host_greets_you = as.factor('f'), 
                  patio_or_balcony = as.factor('f'), 
                  host_since_year = as.factor(2017))

predict(final_fit, new_data = abnb_test1)

Our score actually went up! This makes sense since the Airbnb may be considered to be a better deal now that the price is lower. However, unfortunately for this Host, this Airbnb saw no other bookings, and so their review scores rating remained a 20.


Airbnb Two

Our next Airbnb is an apartment in Paris. It was rated a 93 in our dataset, and our model predicted it to be 95.6:

augment(final_fit, new_data = abnb_te) %>% subset(listing_id == 5625129)

It’s current rating on the website is 4.60 stars or the equivalent of a 92 on our scale. This host had also changed the pricing of their Airbnb, as well as the minimum nights required, so we will amend the prediction to the changes:

abnb_test2 <- data.frame(host_response_time = 'within a few hours',
                  host_response_rate = 1.0, host_acceptance_rate = 1.0,
                  host_is_superhost = as.factor('t'), 
                  host_total_listings_count = 1, 
                  host_has_profile_pic = as.factor('t'), 
                  host_identity_verified = as.factor('t'),
                  city = "Paris", property_type = 'Entire apartment',
                  room_type = "Entire place", accommodates = 4, bedrooms = 1,
                  price = 157, minimum_nights = 4, maximum_nights = 1125,
                  instant_bookable = as.factor('f'), shampoo = as.factor('f'), 
                  dishes_and_silverware = as.factor('f'), 
                  heating = as.factor('t'), iron = as.factor('t'), 
                  kitchen = as.factor('t'), hair_dryer = as.factor('f'),
                  essentials = as.factor('t'), washer = as.factor('t'), 
                  bed_linens = as.factor('f'), refrigerator = as.factor('f'),
                  hot_water = as.factor('f'), oven = as.factor('f'), 
                  wifi = as.factor('t'), cooking_basics = as.factor('f'), 
                  long_term_stays_allowed = as.factor('t'),
                  dedicated_workspace = as.factor('f'), elevator = as.factor('f'), 
                  hangers = as.factor('t'), coffee_maker = as.factor('f'), 
                  carbon_monoxide_alarm = as.factor('f'), 
                  smoke_alarm = as.factor('t'), other = as.factor('f'), 
                  microwave = as.factor('f'), air_conditioning = as.factor('f'), 
                  free_street_parking = as.factor('f'), dryer = as.factor('f'),
                  fire_extinguisher = as.factor('f'), 
                  extra_pillows_and_blankets = as.factor('f'), tv = as.factor('t'),
                  cable_tv = as.factor('t'), first_aid_kit = as.factor('f'),
                  private_entrance = as.factor('f'), 
                  luggage_dropoff_allowed = as.factor('f'), 
                  free_parking_on_premises = as.factor('f'), 
                  stove = as.factor('f'), host_greets_you = as.factor('f'), 
                  patio_or_balcony = as.factor('f'), 
                  host_since_year = as.factor(2014))

predict(final_fit, new_data = abnb_test2)

The prediction stayed about the same for this Airbnb! It was about 4 points off from the true rating: about a 95.6 predicted value and a score of 92 in reality. Not bad!


Airbnb Three

Let’s switch things up a little and look at a more unique property. This is a Roman County House that was classified as Other in our dataset:

augment(final_fit, new_data = abnb_te) %>% subset(listing_id == 2531491)

At the time, the Airbnb was rated a 99 and our model gave it a predicted score of 95.5. The rating of the Airbnb is closer to 100 based now havinf 4.98 stars on the website. Let’s see how different our predictions look after updating the price, total listings, and a few amenities:

abnb_test3 <- data.frame(host_response_time = 'within a few hours',
                  host_response_rate = 1.0, host_acceptance_rate = 0.92,
                  host_is_superhost = as.factor('t'), 
                  host_total_listings_count = 3, 
                  host_has_profile_pic = as.factor('t'), 
                  host_identity_verified = as.factor('t'),
                  city = "Rome", property_type = 'Other',
                  room_type = "Entire place", accommodates = 16, bedrooms = 15,
                  price = 839, minimum_nights = 3, maximum_nights = 1125,
                  instant_bookable = as.factor('f'), shampoo = as.factor('t'), 
                  dishes_and_silverware = as.factor('f'), 
                  heating = as.factor('t'), iron = as.factor('t'), 
                  kitchen = as.factor('t'), hair_dryer = as.factor('t'),
                  essentials = as.factor('t'), washer = as.factor('t'), 
                  bed_linens = as.factor('t'), refrigerator = as.factor('t'),
                  hot_water = as.factor('t'), oven = as.factor('t'), 
                  wifi = as.factor('t'), cooking_basics = as.factor('f'), 
                  long_term_stays_allowed = as.factor('t'),
                  dedicated_workspace = as.factor('t'), elevator = as.factor('f'),
                  hangers = as.factor('t'), coffee_maker = as.factor('t'), 
                  carbon_monoxide_alarm = as.factor('f'), 
                  smoke_alarm = as.factor('f'), other = as.factor('t'), 
                  microwave = as.factor('f'), air_conditioning = as.factor('t'), 
                  free_street_parking = as.factor('f'), dryer = as.factor('t'),
                  fire_extinguisher = as.factor('t'), 
                  extra_pillows_and_blankets = as.factor('t'), tv = as.factor('t'),
                  cable_tv = as.factor('t'), first_aid_kit = as.factor('f'),
                  private_entrance = as.factor('t'), 
                  luggage_dropoff_allowed = as.factor('t'), 
                  free_parking_on_premises = as.factor('t'), 
                  stove = as.factor('t'), host_greets_you = as.factor('f'), 
                  patio_or_balcony = as.factor('f'), 
                  host_since_year = as.factor(2014))

predict(final_fit, new_data = abnb_test3)

Our estimate went down a little bit to 95.1. That is yet again a difference of about four points!


Airbnb Four

Finally let’s take a look at an Airbnb that is not included in our dataset. Here we’re going to take a look at a Tiny home located in Mexico City priced at a mere $23 with a total rating of 4.82 stars or 96.4 in our metrics. Let’s see how our model will rate this Airbnb:

abnb_test4 <- data.frame(host_response_time = 'within an hour',
                  host_response_rate = 1.0, host_acceptance_rate = 1.0,
                  host_is_superhost = as.factor('f'), 
                  host_total_listings_count = 3, 
                  host_has_profile_pic = as.factor('t'), 
                  host_identity_verified = as.factor('t'),
                  city = "Mexico City", property_type = 'Other',
                  room_type = "Entire place", accommodates = 2, bedrooms = 1,
                  price = 23, minimum_nights = 2, maximum_nights = 1125,
                  instant_bookable = as.factor('f'), shampoo = as.factor('t'), 
                  dishes_and_silverware = as.factor('f'), 
                  heating = as.factor('t'), iron = as.factor('t'), 
                  kitchen = as.factor('t'), hair_dryer = as.factor('f'),
                  essentials = as.factor('t'), washer = as.factor('t'), 
                  bed_linens = as.factor('t'), refrigerator = as.factor('f'),
                  hot_water = as.factor('t'), oven = as.factor('f'), 
                  wifi = as.factor('t'), cooking_basics = as.factor('f'), 
                  long_term_stays_allowed = as.factor('t'),
                  dedicated_workspace = as.factor('t'), elevator = as.factor('f'), 
                  hangers = as.factor('t'), coffee_maker = as.factor('f'), 
                  carbon_monoxide_alarm = as.factor('f'), 
                  smoke_alarm = as.factor('f'), other = as.factor('t'), 
                  microwave = as.factor('f'), air_conditioning = as.factor('f'), 
                  free_street_parking = as.factor('t'), dryer = as.factor('t'),
                  fire_extinguisher = as.factor('f'), 
                  extra_pillows_and_blankets = as.factor('f'), tv = as.factor('t'),
                  cable_tv = as.factor('f'), first_aid_kit = as.factor('f'),
                  private_entrance = as.factor('t'), 
                  luggage_dropoff_allowed = as.factor('f'), 
                  free_parking_on_premises = as.factor('t'), 
                  stove = as.factor('f'), host_greets_you = as.factor('f'), 
                  patio_or_balcony = as.factor('f'), 
                  host_since_year = as.factor(2015))

predict(final_fit, new_data = abnb_test4)

Our model predicts that this Airbnb will be rated a 90.9 compared to the actual rating of 96.4: this is a difference of about five points! However, while looking at this listing I noticed that certain kitchen amenities were not included in the description even though they were present in photos. Adjusting for those amenities we get a new predicted rating:

abnb_test4_2 <- data.frame(host_response_time = 'within an hour',
                  host_response_rate = 1.0, host_acceptance_rate = 1.0,
                  host_is_superhost = as.factor('f'), 
                  host_total_listings_count = 3, 
                  host_has_profile_pic = as.factor('t'), 
                  host_identity_verified = as.factor('t'),
                  city = "Mexico City", property_type = 'Other',
                  room_type = "Entire place", accommodates = 2, bedrooms = 1,
                  price = 23, minimum_nights = 2, maximum_nights = 1125,
                  instant_bookable = as.factor('f'), shampoo = as.factor('t'), 
                  dishes_and_silverware = as.factor('f'), 
                  heating = as.factor('t'), iron = as.factor('t'), 
                  kitchen = as.factor('t'), hair_dryer = as.factor('f'),
                  essentials = as.factor('t'), washer = as.factor('t'), 
                  bed_linens = as.factor('t'), refrigerator = as.factor('t'),
                  hot_water = as.factor('t'), oven = as.factor('f'), 
                  wifi = as.factor('t'), cooking_basics = as.factor('t'), 
                  long_term_stays_allowed = as.factor('t'),
                  dedicated_workspace = as.factor('t'), elevator = as.factor('f'), 
                  hangers = as.factor('t'), coffee_maker = as.factor('t'), 
                  carbon_monoxide_alarm = as.factor('f'), 
                  smoke_alarm = as.factor('f'), other = as.factor('t'), 
                  microwave = as.factor('t'), air_conditioning = as.factor('f'), 
                  free_street_parking = as.factor('t'), dryer = as.factor('t'),
                  fire_extinguisher = as.factor('f'), 
                  extra_pillows_and_blankets = as.factor('f'), tv = as.factor('t'),
                  cable_tv = as.factor('f'), first_aid_kit = as.factor('f'),
                  private_entrance = as.factor('t'), 
                  luggage_dropoff_allowed = as.factor('f'), 
                  free_parking_on_premises = as.factor('t'), 
                  stove = as.factor('t'), host_greets_you = as.factor('f'), 
                  patio_or_balcony = as.factor('f'), 
                  host_since_year = as.factor(2015))

predict(final_fit, new_data = abnb_test4_2)

We get 92.5 which is a difference of four points from the actual rating! This omission of information might account for lower prediction ratings compared to the actual listing reviews since their presence increases our score.


Conclusion

Our look into predicting Airbnb ratings has certainly yielded interesting results. We looked into a few models, many of which had fairly poor results on the training set, but we were able to persevere and train a random forest that gave us an rsq of 70.5%! Of course there is lots of room for improvement, but we have a few places where we can start looking.

We noticed earlier that our model was unable to predict values of Airbnbs that were rated poorly. Further exploration could be done into the reasoning behind it since resolving this issue might account for a higher rsq. One possible solution involves adding review counts to the data to see whether our models would perform differently. Some of the properties we looked at had poor sample sizes since a singular review is a poor indicator of the total review scores rating. Another change to consider could be implementing a better balance of review scores ratings. We had noticed in our EDA that there are only a few poorly rated Airbnbs. This makes it is easier for our model to predict higher review scores ratings regardless of what it may actually be since there are many more highly rated Airbnbs. Though my access to data is limited, it might be a good idea to implement both of these on a new dataset!

In the end, our model provided use with some useful knowledge about cleaning data and building models, and we can walk away happily with some useful predictions and a small glimpse into the world of Airbnb rentals and customer satisfaction.